We present a robust, privacy-preserving visual localization algorithm using event cameras. While event cameras can potentially make robust localization due to high dynamic range and small motion blur, the sensors exhibit large domain gaps making it difficult to directly apply conventional image-based localization algorithms. To mitigate the gap, we propose applying event-to-image conversion prior to localization which leads to stable localization. In the privacy perspective, event cameras capture only a fraction of visual information compared to normal cameras, and thus can naturally hide sensitive visual details. To further enhance the privacy protection in our event-based pipeline, we introduce privacy protection at two levels, namely sensor and network level. Sensor level protection aims at hiding facial details with lightweight filtering while network level protection targets hiding the entire user's view in private scene applications using a novel neural network inference pipeline. Both levels of protection involve light-weight computation and incur only a small performance loss. We thus project our method to serve as a building block for practical location-based services using event cameras. The code and dataset will be made public through the following link: https://github.com/82magnolia/event_localization.
translated by 谷歌翻译
Current computer vision models, unlike the human visual system, cannot yet achieve general-purpose visual understanding. Existing efforts to create a general vision model are limited in the scope of assessed tasks and offer no overarching framework to perform them holistically. We present a new comprehensive benchmark, General-purpose Visual Understanding Evaluation (G-VUE), covering the full spectrum of visual cognitive abilities with four functional domains $\unicode{x2014}$ Perceive, Ground, Reason, and Act. The four domains are embodied in 11 carefully curated tasks, from 3D reconstruction to visual reasoning and manipulation. Along with the benchmark, we provide a general encoder-decoder framework to allow for the evaluation of arbitrary visual representation on all 11 tasks. We evaluate various pre-trained visual representations with our framework and observe that (1) Transformer-based visual backbone generally outperforms CNN-based backbone on G-VUE, (2) visual representations from vision-language pre-training are superior to those with vision-only pre-training across visual tasks. With G-VUE, we provide a holistic evaluation standard to motivate research toward building general-purpose visual systems via obtaining more general-purpose visual representations.
translated by 谷歌翻译
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.
translated by 谷歌翻译
Various depth estimation models are now widely used on many mobile and IoT devices for image segmentation, bokeh effect rendering, object tracking and many other mobile tasks. Thus, it is very crucial to have efficient and accurate depth estimation models that can run fast on low-power mobile chipsets. In this Mobile AI challenge, the target was to develop deep learning-based single image depth estimation solutions that can show a real-time performance on IoT platforms and smartphones. For this, the participants used a large-scale RGB-to-depth dataset that was collected with the ZED stereo camera capable to generated depth maps for objects located at up to 50 meters. The runtime of all models was evaluated on the Raspberry Pi 4 platform, where the developed solutions were able to generate VGA resolution depth maps at up to 27 FPS while achieving high fidelity results. All models developed in the challenge are also compatible with any Android or Linux-based mobile devices, their detailed description is provided in this paper.
translated by 谷歌翻译
选择第一次到达的Prestack收集时间被称为首次到达时间(FAT)采摘,这是地震数据处理中必不可少的一步,并且主要是手动解决的。随着当前地震数据收集密度的增加,手动采摘效率无法满足实际需求。因此,近几十年来,自动采摘方法已经大大开发出来,尤其是基于深度学习的方法。但是,当前有监督的基于深度学习的方法很少可以避免对标记样品的依赖。此外,由于收集数据是一组与自然图像大不相同的信号,因此当前方法在低信号与噪声比(SNR)的情况下很难解决脂肪拾取问题。在本文中,对于Hard Rock地震收集数据,我们提出了一个多阶段分割拾取网络(MSSPN),该网络解决了跨工作地点的概括问题以及在低SNR的情况下的采摘问题。在MSSPN中,有四个子模型可以模拟手动采摘处理,从而将其假定为从粗糙到细的四个阶段。具有不同质量的七个现场数据集的实验表明,我们的MSSPN的表现优于大幅度的基准。尤其是,在中等和高snrs的情况下,我们的方法可以实现超过90 \%的精确拾取,甚至精细模型也可以使用低SNR实现88 \%精确的数据集。
translated by 谷歌翻译
闪光照明广泛用于在弱光环境下的成像中。然而,照明强度在繁殖距离四边形掉落,这对长距离闪存成像构成了重大挑战。我们提出了一种新的Flash技术,称为“图案闪光灯”,用于长途闪光灯成像。图案闪光灯将光功率浓缩到点阵列中。与传统的均匀闪光灯相比,信号被各地的噪声淹没,图案闪光灯在整个视野的稀疏分布点上提供了更强的信号,以确保这些点处的信号从传感器噪声中脱颖而出。这使后处理能够解决重要的对象和细节。此外,图案闪光灯将纹理投影到场景上,可以将其视为深度感知的结构化光系统。鉴于新型系统,我们使用卷积神经网络开发了联合图像重建和深度估计算法。我们构建硬件原型,并在各种场景上测试提出的闪存技术。实验结果表明,在弱光环境中,我们的图案闪光在长距离的性能明显更好。
translated by 谷歌翻译
结构光(SL)系统以主动照明投影获得高保真3D几何形状。当在具有强烈的环境照明,全球照明和跨设备干扰的环境中工作时,常规系统会出现挑战。本文提出了一种通用技术,以通过投影除天然SL模式来预测冗余光学信号来提高SL的鲁棒性。这样,预计的信号与错误更具区别。因此,可以使用简单的信号处理更容易地恢复几何信息,并获得``性能中的编码增益''。我们使用冗余代码提出了三个应用程序:(1)在强环境光下进行SL成像的自我错误校正,((( 2)在全球照明下自适应重建的错误检测,以及(3)使用设备特定的投影序列编码的干扰过滤,尤其是针对基于事件摄像机的SL和灯窗帘设备。我们系统地分析了这些应用中的设计规则和信号处理算法。相应的硬件原型是用于在现实世界复杂场景上进行评估的。合成和真实数据的实验结果证明了具有冗余代码的SL系统的显着性能改进。
translated by 谷歌翻译
自动驾驶系统需要对周围环境有很好的了解,包括移动障碍物和静态高清(HD)语义图。现有方法通过离线手动注释来解决语义图问题,该注释遭受了严重的可伸缩性问题。最新的基于学习的方法产生了密集的栅格分割预测,这些预测不包含单个地图元素的实例信息,并且需要涉及许多手工设计的组件的启发式后处理,以获得矢量化的地图。为此,我们引入了一个端到端矢量化的高清图学习管道,称为ve​​ctormapnet。 Vectormapnet进行了板载传感器的观测值,并预测了鸟类视图中的一组稀疏的散布原料,以建模HD地图的几何形状。基于此管道,我们的方法可以明确地对地图元素之间的空间关系进行建模,并生成对矢量化的地图,这些矢量图对于下游自主驾驶任务友好而无需进行后处理。在我们的实验中,VectorMapnet在Nuscenes数据集上实现了强大的HD MAP学习性能,从而超过了先前的最新方法,可以通过14.2地图。从定性上讲,我们还表明Vectormapnet能够生成综合地图并捕获更多的道路几何细节。据我们所知,VectorMapnet是针对端到端矢量化的HD MAP学习问题设计的第一部作品。
translated by 谷歌翻译
从理论上讲,通过引入蛋白质3D结构信息,可以改善化合物蛋白结合亲和力(CPA)中计算模型的准确性。但是,由于缺乏有效编码信息蛋白质特征的有效方法,这些模型中的大多数仍然存在低精度。主要的挑战是如何结合多模式信息,例如蛋白质的残基序列,残基原子坐标和扭转角。为了解决这个问题,我们开发了快速的进化关注和彻底的图形神经网络(featnn),以促进蛋白质3D结构信息的应用以预测CPA。具体而言,我们建立了一种新型的端到端结构,以共同嵌入扭转矩阵,离散距离矩阵以及蛋白质和提取具有深图卷积层的复合特征的序列信息。此外,引入了一种新的成对映射注意机制,以全面了解蛋白质和化合物之间的潜在相互作用信息。在CPA预测中,R2系数升高约21.33%,在CPA预测中的各种最新基准都大大优于各种最新基线。因此,壮举为高度准确的CPA预测提供了出色的方法,并促进了候选药物的高通量虚拟筛查。
translated by 谷歌翻译
在移动摄影和AR / VR中,视觉内容创建刺激了飙升的兴趣。作为两个代表性任务的样式转移和单像3D摄影迄今为止独立发展。在本文中,我们在两者之间进行了联系,并解决了3D照片风格化的具有挑战性的任务 - 从单个图像中生成了一个任意映像的程式化的小说视图。我们的关键直观是,风格转移和视图综合必须为此任务共同建模。为此,我们提出了一个深入的模型,可以从场景的点云表示,从场景的点云表示,学习几何风格感知内容特征,从而导致跨视图一致的高质量风格化图像。此外,我们介绍了一种新颖的训练协议,以使学习仅使用2D图像。我们通过广泛的定性和定量研究展示了我们的方法的优越性,以及鉴于从2D图像资产的3D内容创建的需求不断增长,展示我们方法的关键应用。
translated by 谷歌翻译